Londoners can use a bicycle rental system for commuting within the city every day. These trips can be to get to work or school or to go sightseeing on a pleasant summer day. The number of rented bicycles depends on important factors such as air temperature, air humidity, time of day, etc. In this project, we are trying to predict the number of bicycles that can be rented every hour of the day.
The London bike sharing data available on Kaggle provides information about the bike sharing system in London. The dataset contains information about the bike rides taken by users of the system, including the start and end time of each ride, the duration of the ride, and the start and end station of each ride.
The dataset contains over 17,414 records spanning from 2015-01-04 to 2017-01-03. This data includes information about the weather conditions at the time of each ride, such as temperature, humidity, wind speed, and precipitation. Additionally, it includes information about public holidays and events that may have affected bike usage. The description of each variable in this data is provided as below:
Also. cnt represents the label (the y value) our model must be trained to predict. The other columns are potential features (x values). Therefore, the total number of bike shares (cnt) is the response variable in this data and the main purpose of this report is to investigate the effect of independent variables on the response.
# Define the variables and their titles
variables <- c("timestamp", "cnt", "t1", "t2", "hum", "wind_speed", "weather_code",
"is_holiday", "is_weekend", "season", "year", "month", "day", "hour")
titles <- c("Timestamp", "Count", "Temperature 1", "Temperature 2: Feel-like", "Humidity",
"Wind Speed", "Weather Code", "Is Holiday", "Is Weekend", "Season",
"Year", "Month", "Day", "Hour")
# Create a data frame with the variables and titles
variable_table <- data.frame(Variable = variables, Title = titles)
# Print the table
print(variable_table)
## Variable Title
## 1 timestamp Timestamp
## 2 cnt Count
## 3 t1 Temperature 1
## 4 t2 Temperature 2: Feel-like
## 5 hum Humidity
## 6 wind_speed Wind Speed
## 7 weather_code Weather Code
## 8 is_holiday Is Holiday
## 9 is_weekend Is Weekend
## 10 season Season
## 11 year Year
## 12 month Month
## 13 day Day
## 14 hour Hour
The data is loaded into R using the below R codes and all the required libraries in this report are loaded as well:
library(dplyr)
library(skimr)
library(ggplot2)
library(lubridate)
library(tibble)
library(caret)
library(e1071)
library(rpart)
library(PerformanceAnalytics)
library(xgboost)
library(randomForest)
# set a theme for ggplots
theme_set(
theme_bw(base_size = 15)+
theme(plot.title.position = "plot")
)
# load our dataset
bike <- read.csv(file = "london_merged.csv", header = T)
# print the dimension of the data
dim(bike)
## [1] 17414 10
get a summary:
skim(bike)
| Name | bike |
| Number of rows | 17414 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| timestamp | 0 | 1 | 19 | 19 | 0 | 17414 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| cnt | 0 | 1 | 1143.10 | 1085.11 | 0.0 | 257 | 844.0 | 1671.75 | 7860.0 | ▇▂▁▁▁ |
| t1 | 0 | 1 | 12.47 | 5.57 | -1.5 | 8 | 12.5 | 16.00 | 34.0 | ▂▇▇▂▁ |
| t2 | 0 | 1 | 11.52 | 6.62 | -6.0 | 6 | 12.5 | 16.00 | 34.0 | ▂▅▇▂▁ |
| hum | 0 | 1 | 72.32 | 14.31 | 20.5 | 63 | 74.5 | 83.00 | 100.0 | ▁▂▅▇▅ |
| wind_speed | 0 | 1 | 15.91 | 7.89 | 0.0 | 10 | 15.0 | 20.50 | 56.5 | ▆▇▃▁▁ |
| weather_code | 0 | 1 | 2.72 | 2.34 | 1.0 | 1 | 2.0 | 3.00 | 26.0 | ▇▁▁▁▁ |
| is_holiday | 0 | 1 | 0.02 | 0.15 | 0.0 | 0 | 0.0 | 0.00 | 1.0 | ▇▁▁▁▁ |
| is_weekend | 0 | 1 | 0.29 | 0.45 | 0.0 | 0 | 0.0 | 1.00 | 1.0 | ▇▁▁▁▃ |
| season | 0 | 1 | 1.49 | 1.12 | 0.0 | 0 | 1.0 | 2.00 | 3.0 | ▇▇▁▇▇ |
In order to pre-process the data, we should pay attention to a few aspects of the data including the variable types, number of missing data in each column, scaling the variables (if necessary), excluding unused variables etc.
# add year, month, day and hour to the data
bike <- bike %>% mutate(
year=year(bike$timestamp),
month=month(bike$timestamp),
day=day(bike$timestamp),
hour=hour(bike$timestamp))
Take a closer look at our dataset:
summary(bike)
## timestamp cnt t1 t2
## Length:17414 Min. : 0 Min. :-1.50 Min. :-6.00
## Class :character 1st Qu.: 257 1st Qu.: 8.00 1st Qu.: 6.00
## Mode :character Median : 844 Median :12.50 Median :12.50
## Mean :1143 Mean :12.47 Mean :11.52
## 3rd Qu.:1672 3rd Qu.:16.00 3rd Qu.:16.00
## Max. :7860 Max. :34.00 Max. :34.00
## hum wind_speed weather_code is_holiday
## Min. : 20.50 Min. : 0.00 Min. : 1.000 Min. :0.00000
## 1st Qu.: 63.00 1st Qu.:10.00 1st Qu.: 1.000 1st Qu.:0.00000
## Median : 74.50 Median :15.00 Median : 2.000 Median :0.00000
## Mean : 72.32 Mean :15.91 Mean : 2.723 Mean :0.02205
## 3rd Qu.: 83.00 3rd Qu.:20.50 3rd Qu.: 3.000 3rd Qu.:0.00000
## Max. :100.00 Max. :56.50 Max. :26.000 Max. :1.00000
## is_weekend season year month
## Min. :0.0000 Min. :0.000 Min. :2015 Min. : 1.000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:2015 1st Qu.: 4.000
## Median :0.0000 Median :1.000 Median :2016 Median : 7.000
## Mean :0.2854 Mean :1.492 Mean :2016 Mean : 6.515
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:2016 3rd Qu.:10.000
## Max. :1.0000 Max. :3.000 Max. :2017 Max. :12.000
## day hour
## Min. : 1.00 Min. : 0.00
## 1st Qu.: 8.00 1st Qu.: 6.00
## Median :16.00 Median :12.00
## Mean :15.75 Mean :11.51
## 3rd Qu.:23.00 3rd Qu.:18.00
## Max. :31.00 Max. :23.00
In order to check for the number of missing values in each column (variable), the following code is used:
# count number of missing values in each column
nm <- colSums(is.na(bike))
# convert missing count information to a table
tibble(Variable=names(nm), Number_of_Missing=as.vector(nm)) %>%
knitr::kable()
| Variable | Number_of_Missing |
|---|---|
| timestamp | 0 |
| cnt | 0 |
| t1 | 0 |
| t2 | 0 |
| hum | 0 |
| wind_speed | 0 |
| weather_code | 0 |
| is_holiday | 0 |
| is_weekend | 0 |
| season | 0 |
| year | 0 |
| month | 0 |
| day | 0 |
| hour | 0 |
non_missing_percentage <- colMeans(!is.na(bike)) * 100
non_missing_df <- data.frame(Variable = names(non_missing_percentage), Percentage = non_missing_percentage)
ggplot(data = non_missing_df, aes(x = Variable, y = Percentage)) +
geom_bar(stat = "identity", fill = "steelblue", position = position_stack(vjust = 1)) +
geom_text(aes(label = paste0(round(Percentage), "%")), vjust = 0.5) +
labs(title = "Percentage of Non-Missing Values", x = "Variable", y = "Percentage") +
coord_flip()
# Define categorical variables
categorical_vars <- c("weather_code", "is_holiday", "is_weekend", "season", "year", "month", "day", "hour")
# Create a tibble for categorical variables
categorical_table <- tibble(
Categorical_Variables = categorical_vars
)
# Print the categorical table
print(categorical_table)
## # A tibble: 8 × 1
## Categorical_Variables
## <chr>
## 1 weather_code
## 2 is_holiday
## 3 is_weekend
## 4 season
## 5 year
## 6 month
## 7 day
## 8 hour
# Define numerical variables
numerical_vars <- c("cnt", "t1", "t2", "hum", "wind_speed")
# Create a tibble for numerical variables
numerical_table <- tibble(
Numerical_Variables = numerical_vars
)
# Print the numerical table
print(numerical_table)
## # A tibble: 5 × 1
## Numerical_Variables
## <chr>
## 1 cnt
## 2 t1
## 3 t2
## 4 hum
## 5 wind_speed
# Get the structure of the dataset
data_info <- str(bike)
## 'data.frame': 17414 obs. of 14 variables:
## $ timestamp : chr "2015-01-04 00:00:00" "2015-01-04 01:00:00" "2015-01-04 02:00:00" "2015-01-04 03:00:00" ...
## $ cnt : int 182 138 134 72 47 46 51 75 131 301 ...
## $ t1 : num 3 3 2.5 2 2 2 1 1 1.5 2 ...
## $ t2 : num 2 2.5 2.5 2 0 2 -1 -1 -1 -0.5 ...
## $ hum : num 93 93 96.5 100 93 93 100 100 96.5 100 ...
## $ wind_speed : num 6 5 0 0 6.5 4 7 7 8 9 ...
## $ weather_code: num 3 1 1 1 1 1 4 4 4 3 ...
## $ is_holiday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ is_weekend : num 1 1 1 1 1 1 1 1 1 1 ...
## $ season : num 3 3 3 3 3 3 3 3 3 3 ...
## $ year : num 2015 2015 2015 2015 2015 ...
## $ month : num 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 4 4 4 4 4 4 4 4 4 4 ...
## $ hour : int 0 1 2 3 4 5 6 7 8 9 ...
# Extract the variable types
variable_types <- data_info$info$type
# Create a data frame with variable names and types
variable_table <- data.frame(Variable = names(variable_types), Type = variable_types, stringsAsFactors = FALSE)
# Divide variables into categorical and numerical
categorical_vars <- variable_table$Variable[variable_table$Type == "character"]
numerical_vars <- variable_table$Variable[variable_table$Type != "character"]
# Create tables for categorical and numerical variables
categorical_table <- data.frame(Categorical_Variables = categorical_vars)
numerical_table <- data.frame(Numerical_Variables = numerical_vars)
# Print the tables
print(categorical_table)
## data frame with 0 columns and 0 rows
print(numerical_table)
## data frame with 0 columns and 0 rows
# count number of missing values in each column
nm <- colSums(is.na(bike))
# convert missing count information to a table
tibble(Variable=names(nm), Number_of_Missing=as.vector(nm)) %>%
knitr::kable()
| Variable | Number_of_Missing |
|---|---|
| timestamp | 0 |
| cnt | 0 |
| t1 | 0 |
| t2 | 0 |
| hum | 0 |
| wind_speed | 0 |
| weather_code | 0 |
| is_holiday | 0 |
| is_weekend | 0 |
| season | 0 |
| year | 0 |
| month | 0 |
| day | 0 |
| hour | 0 |
In the following codes, a suitable variable type is assigned to each variable based on the description provided in the previous sections.
# replace the values more than 4 in weather_code with 4
bike$weather_code[bike$weather_code > 4] <- 4
# variable type setting
# bike$timestamp <- as.Date(bike$timestamp)
bike$weather_code <- as.factor(bike$weather_code)
bike$is_holiday <- as.factor(bike$is_holiday)
bike$is_weekend <- as.factor(bike$is_weekend)
bike$season <- as.factor(bike$season)
bike$year <- as.factor(bike$year)
bike$month <- as.factor(bike$month)
bike$day <- as.factor(bike$day)
bike$hour <- as.factor(bike$hour)
# print the first 10 rows of the data
head(bike, n=10)
## timestamp cnt t1 t2 hum wind_speed weather_code is_holiday
## 1 2015-01-04 00:00:00 182 3.0 2.0 93.0 6.0 3 0
## 2 2015-01-04 01:00:00 138 3.0 2.5 93.0 5.0 1 0
## 3 2015-01-04 02:00:00 134 2.5 2.5 96.5 0.0 1 0
## 4 2015-01-04 03:00:00 72 2.0 2.0 100.0 0.0 1 0
## 5 2015-01-04 04:00:00 47 2.0 0.0 93.0 6.5 1 0
## 6 2015-01-04 05:00:00 46 2.0 2.0 93.0 4.0 1 0
## 7 2015-01-04 06:00:00 51 1.0 -1.0 100.0 7.0 4 0
## 8 2015-01-04 07:00:00 75 1.0 -1.0 100.0 7.0 4 0
## 9 2015-01-04 08:00:00 131 1.5 -1.0 96.5 8.0 4 0
## 10 2015-01-04 09:00:00 301 2.0 -0.5 100.0 9.0 3 0
## is_weekend season year month day hour
## 1 1 3 2015 1 4 0
## 2 1 3 2015 1 4 1
## 3 1 3 2015 1 4 2
## 4 1 3 2015 1 4 3
## 5 1 3 2015 1 4 4
## 6 1 3 2015 1 4 5
## 7 1 3 2015 1 4 6
## 8 1 3 2015 1 4 7
## 9 1 3 2015 1 4 8
## 10 1 3 2015 1 4 9
# print the last 10 rows of the data
tail(bike, n=10)
## timestamp cnt t1 t2 hum wind_speed weather_code is_holiday
## 17405 2017-01-03 14:00:00 765 6.0 2.0 73.5 22 3 0
## 17406 2017-01-03 15:00:00 845 6.0 2.0 71.0 27 4 0
## 17407 2017-01-03 16:00:00 1201 6.0 2.0 71.0 26 4 0
## 17408 2017-01-03 17:00:00 2742 6.0 2.0 73.5 21 3 0
## 17409 2017-01-03 18:00:00 2220 5.0 1.0 81.0 22 2 0
## 17410 2017-01-03 19:00:00 1042 5.0 1.0 81.0 19 3 0
## 17411 2017-01-03 20:00:00 541 5.0 1.0 81.0 21 4 0
## 17412 2017-01-03 21:00:00 337 5.5 1.5 78.5 24 4 0
## 17413 2017-01-03 22:00:00 224 5.5 1.5 76.0 23 4 0
## 17414 2017-01-03 23:00:00 139 5.0 1.0 76.0 22 2 0
## is_weekend season year month day hour
## 17405 0 3 2017 1 3 14
## 17406 0 3 2017 1 3 15
## 17407 0 3 2017 1 3 16
## 17408 0 3 2017 1 3 17
## 17409 0 3 2017 1 3 18
## 17410 0 3 2017 1 3 19
## 17411 0 3 2017 1 3 20
## 17412 0 3 2017 1 3 21
## 17413 0 3 2017 1 3 22
## 17414 0 3 2017 1 3 23
# data summarization
skim(bike)
| Name | bike |
| Number of rows | 17414 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| factor | 8 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| timestamp | 0 | 1 | 19 | 19 | 0 | 17414 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| weather_code | 0 | 1 | FALSE | 4 | 1: 6150, 2: 4034, 4: 3679, 3: 3551 |
| is_holiday | 0 | 1 | FALSE | 2 | 0: 17030, 1: 384 |
| is_weekend | 0 | 1 | FALSE | 2 | 0: 12444, 1: 4970 |
| season | 0 | 1 | FALSE | 4 | 0: 4394, 1: 4387, 3: 4330, 2: 4303 |
| year | 0 | 1 | FALSE | 3 | 201: 8699, 201: 8643, 201: 72 |
| month | 0 | 1 | FALSE | 12 | 5: 1488, 1: 1487, 8: 1484, 12: 1484 |
| day | 0 | 1 | FALSE | 31 | 6: 576, 14: 576, 21: 576, 22: 576 |
| hour | 0 | 1 | FALSE | 24 | 16: 730, 12: 729, 15: 729, 13: 728 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| cnt | 0 | 1 | 1143.10 | 1085.11 | 0.0 | 257 | 844.0 | 1671.75 | 7860.0 | ▇▂▁▁▁ |
| t1 | 0 | 1 | 12.47 | 5.57 | -1.5 | 8 | 12.5 | 16.00 | 34.0 | ▂▇▇▂▁ |
| t2 | 0 | 1 | 11.52 | 6.62 | -6.0 | 6 | 12.5 | 16.00 | 34.0 | ▂▅▇▂▁ |
| hum | 0 | 1 | 72.32 | 14.31 | 20.5 | 63 | 74.5 | 83.00 | 100.0 | ▁▂▅▇▅ |
| wind_speed | 0 | 1 | 15.91 | 7.89 | 0.0 | 10 | 15.0 | 20.50 | 56.5 | ▆▇▃▁▁ |
The above results show that the London bike sharing data is summarised based on the variable type. In the following subsection, the data are visualized with more details.
# histogram of cnt
ggplot(data=bike, aes(x=cnt))+
geom_histogram(fill='cadetblue4', col='grey', bins = 30)+
scale_x_continuous(n.breaks = 10)+
scale_y_continuous(n.breaks = 10)+
geom_vline(xintercept = mean(bike$cnt), colour='green', linetype='dashed', linewidth=2)+
geom_vline(xintercept = median(bike$cnt), colour='red', linetype='dotted', linewidth=2)+
labs(x='total number of bike shares in London',
title = 'The histogram of total number of bike shares in London',
subtitle = 'The red dashed line=mean(cnt) and the green dotted line=median(cnt)')
# boxplot of cnt
ggplot(data=bike, aes(x=cnt))+
geom_boxplot(fill='cadetblue4')+
scale_x_continuous(n.breaks = 20)+
scale_y_discrete()+
geom_vline(xintercept = mean(bike$cnt), colour='green', linetype='dashed', linewidth=2)+
geom_vline(xintercept = median(bike$cnt), colour='red', linetype='dotted', linewidth=2)+
labs(x='total number of bike shares in London',
title = 'The boxplot of total number of bike shares in London',
subtitle = 'The red dashed line=mean(cnt) and the green dotted line=median(cnt)')
# histogram of t1
ggplot(data=bike, aes(x=t1))+
geom_histogram(fill='cadetblue4', col='grey', bins = 30)+
scale_x_continuous(n.breaks = 10)+
scale_y_continuous(n.breaks = 10)+
geom_vline(xintercept = mean(bike$t1), colour='green', linetype='dashed', linewidth=2)+
geom_vline(xintercept = median(bike$t1), colour='red', linetype='dotted', linewidth=2)+
labs(x='the temperature in celsius',
title = 'The histogram of the temperature in celsius (t1)',
subtitle = 'The red dashed line=mean(t1) and the green dotted line=median(t1)')
# histogram of t2
ggplot(data=bike, aes(x=t2))+
geom_histogram(fill='cadetblue4', col='grey', bins = 30)+
scale_x_continuous(n.breaks = 10)+
scale_y_continuous(n.breaks = 10)+
geom_vline(xintercept = mean(bike$t2), colour='green', linetype='dashed', linewidth=2)+
geom_vline(xintercept = median(bike$t2), colour='red', linetype='dotted', linewidth=2)+
labs(x='the apparent (feels-like) temperature in celsius',
title = 'The histogram of the apparent (feels-like) temperature in celsius (t2)',
subtitle = 'The red dashed line=mean(t2) and the green dotted line=median(t2)')
# histogram of hum
ggplot(data=bike, aes(x=hum))+
geom_histogram(fill='cadetblue4', col='grey', bins = 30)+
scale_x_continuous(n.breaks = 10)+
scale_y_continuous(n.breaks = 10)+
geom_vline(xintercept = mean(bike$hum), colour='green', linetype='dashed', linewidth=2)+
geom_vline(xintercept = median(bike$hum), colour='red', linetype='dotted', linewidth=2)+
labs(x='the humidity level',
title = 'The histogram of the humidity level (hum)',
subtitle = 'The red dashed line=mean(hum) and the green dotted line=median(hum)')
# histogram of wind_speed
ggplot(data=bike, aes(x=wind_speed))+
geom_histogram(fill='cadetblue4', col='grey', bins = 30)+
scale_x_continuous(n.breaks = 10)+
scale_y_continuous(n.breaks = 10)+
geom_vline(xintercept = mean(bike$wind_speed), colour='green', linetype='dashed', linewidth=2)+
geom_vline(xintercept = median(bike$wind_speed), colour='red', linetype='dotted', linewidth=2)+
labs(x='the wind speed',
title = 'The histogram of the wind speed (wind_speed)',
subtitle = 'The red dashed line=mean(wind_speed) and the green dotted line=median(wind_speed)')
Overall, The numeric features (independent variables) seem to be more normally distributed compared to the response variable because the mean and median of the variables are nearer the middle of the range of values except humidity where mean and median is towards right side, coinciding with where the most commonly occurring values are. Now that the distribution of the numerical variables are investigated, it is time to perform a correlation analysis to investigate the relationship between variables.
# QQ plot for cnt
qqnorm(bike$cnt, main = "QQ Plot: Bike Count (cnt)")
qqline(bike$cnt)
# QQ plot for t1
qqnorm(bike$t1, main = "QQ Plot: Temperature (t1)")
qqline(bike$t1)
# QQ plot for hum
qqnorm(bike$hum, main = "QQ Plot: Humidity (hum)")
qqline(bike$hum)
# QQ plot for wind_speed
qqnorm(bike$wind_speed, main = "QQ Plot: Wind Speed")
qqline(bike$wind_speed)
# Pie chart for is_holiday
ggplot(data = bike, aes(x = "", fill = is_holiday)) +
geom_bar(width = 1) +
coord_polar(theta = "y") +
xlab("") +
ylab("") +
guides(fill = guide_legend(title = "is_holiday")) +
scale_fill_manual(values = c("cadetblue4", "cyan3"), labels = c("NO", "YES")) +
labs(title = "Pie Chart - is_holiday")
# Pie chart for weather_code
ggplot(data = bike, aes(x = "", fill = weather_code)) +
geom_bar(width = 1) +
coord_polar(theta = "y") +
xlab("") +
ylab("") +
guides(fill = guide_legend(title = "weather_code")) +
scale_fill_manual(values = c("cadetblue4", "cyan3", "darkgoldenrod2", "coral2"),
labels = c('Clear','Mist/Cloud','Light Rain/Snow','Heavy Rain/Hail/Snow/Fog')) +
labs(title = "Pie Chart - weather_code")
# Pie chart for is_weekend
ggplot(data = bike, aes(x = "", fill = is_weekend)) +
geom_bar(width = 1) +
coord_polar(theta = "y") +
xlab("") +
ylab("") +
guides(fill = guide_legend(title = "is_weekend")) +
scale_fill_manual(values = c("cadetblue4", "cyan3"), labels = c("NO", "YES")) +
labs(title = "Pie Chart - is_weekend")
# Pie chart for season
ggplot(data = bike, aes(x = "", fill = season)) +
geom_bar(width = 1) +
coord_polar(theta = "y") +
xlab("") +
ylab("") +
guides(fill = guide_legend(title = "season")) +
scale_fill_manual(values = c("cadetblue4", "cyan3", "darkgoldenrod2", "coral2"),labels=c('Spring','Summer','Fall','Winter')) +
labs(title = "Pie Chart - season")
Now we use box-plots in order to compare each categorical independent variable with the response variable (the number of bike shares in London - cnt):
# boxplot of cnt and is_weekend
ggplot(data=bike, aes(x=is_weekend, y=cnt, fill=is_weekend))+
geom_boxplot(show.legend = F)+
scale_y_continuous(n.breaks = 10)+
scale_x_discrete(breaks=c(0,1), labels=c('No','Yes'))+
theme_bw(base_size = 15)+
labs(x='whether or not the day is a weekend',y="the total number of bike shares in London",
title = 'The boxplot of total number of bike shares in London for weekend indicator')
The median for weekends is lower compared to non-weekends, indicating that the median number of bike rentals is generally lower on weekends. The box for weekends is smaller than the box for non-weekends, suggesting less variability in bike rentals on weekends.
# boxplot of cnt and is_holiday
ggplot(data=bike, aes(x=is_holiday, y=cnt, fill=is_holiday))+
geom_boxplot(show.legend = F)+
scale_y_continuous(n.breaks = 10)+
scale_x_discrete(breaks=c(0,1), labels=c('No','Yes'))+
labs(x='whether or not the day is a holiday',y="the total number of bike shares in London",
title = 'The boxplot of total number of bike shares in London for holiday indicator')
# boxplot of cnt and season
ggplot(data=bike, aes(x=season, y=cnt, fill=season))+
geom_boxplot(show.legend = F)+
scale_y_continuous(n.breaks = 10)+
scale_x_discrete(breaks=c(0,1,2,3), labels=c('Spring','Summer','Fall','Winter'))+
labs(x='season', y="the total number of bike shares in London",
title = 'The boxplot of total number of bike shares in London for different seasons')
# boxplot of cnt and weather_code
ggplot(data=bike, aes(x=weather_code, y=cnt, fill=weather_code))+
geom_boxplot(show.legend = F)+
scale_y_continuous(n.breaks = 10)+
scale_x_discrete(breaks=c(1,2,3,4), labels=c('Clear','Mist/Cloud','Light Rain/Snow','Heavy Rain/Hail/Snow/Fog'))+
labs(x='the weather situation', y="the total number of bike shares in London",
title = 'The boxplot of total number of bike shares in London for different weather situations')
# boxplot of cnt and year
ggplot(data=bike, aes(x=year, y=cnt, fill=year))+
geom_boxplot(show.legend = F)+
scale_y_continuous(n.breaks = 10)+
labs(x='year', y="the total number of bike shares in London",
title = 'The boxplot of total number of bike shares in London for different years')
# boxplot of cnt and month
ggplot(data=bike, aes(x=month, y=cnt, fill=month))+
geom_boxplot(show.legend = F)+
scale_y_continuous(n.breaks = 10)+
labs(x='month', y="the total number of bike shares in London",
title = 'The boxplot of total number of bike shares in London for different months')
# boxplot of cnt and day
ggplot(data=bike, aes(x=day, y=cnt, fill=day))+
geom_boxplot(show.legend = F)+
scale_y_continuous(n.breaks = 10)+
labs(x='day', y="the total number of bike shares in London",
title = 'The boxplot of total number of bike shares in London for different days')
# boxplot of cnt and hour
ggplot(data=bike, aes(x=hour, y=cnt, fill=hour))+
geom_boxplot(show.legend = F)+
scale_y_continuous(n.breaks = 10)+
labs(x='hour', y="the total number of bike shares in London",
title = 'The boxplot of total number of bike shares in London for different hours')
# create another categorical variable to discretion the hours
h <- bike$hour %>% as.numeric()
hour_cat <- ifelse(h > 6 & h < 10, 'between_6_and_10',
ifelse(h > 10 & h < 17, 'between_10_and_17',
ifelse(h > 17 & h < 20,'between_17_and_20',
'between_20_and_6')))
# add the hour_cat to the data and replace it by hour
bike <- bike %>% mutate(hour=as.factor(hour_cat))
# boxplot of cnt and hour categories
ggplot(data=bike, aes(x=hour, y=cnt, fill=hour))+
geom_boxplot(show.legend = F)+
scale_y_continuous(n.breaks = 10)+
labs(x='hour intervals', y="the total number of bike shares in London",
title = 'The boxplot of total number of bike shares in London for different hour intervals')
numerical_vars <- bike %>% select(cnt:wind_speed) %>% mutate(cnt=as.numeric(cnt))
chart.Correlation(numerical_vars, histogram = F, method = "pearson")
The numeric features (independent variables) seem to be more normally distributed compared to the response variable because the mean and median of the variables are nearer the middle of the range of values except humidity where the mean and median are towards the right side, coinciding with where the most commonly occurring values are. Now that the distribution of the numerical variables is investigated, it is time to perform a correlation analysis to investigate the relationship between variables.
# Calculate correlation matrix
cor_matrix <- cor(numerical_vars)
# Convert correlation matrix to long format
cor_data <- reshape2::melt(cor_matrix)
# Plot heatmap with correlation values using dark mode colors
ggplot(cor_data, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "black") +
scale_fill_gradient2(low = "white", mid = "red", high = "black",
midpoint = 0, limits = c(-1, 1), name = "Correlation") +
geom_text(aes(label = round(value, 2)), size = 6, color = "white") +
labs(x = "", y = "") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.background = element_rect(fill = "white"),
plot.title = element_text(color = "green"),
axis.text = element_text(color = "black"),
legend.title = element_text(color = "black"),
legend.text = element_text(color = "black"))
The variable “cnt” (total number of bike shares) has a positive correlation of 0.39 with “t1” (temperature), indicating a weak positive relationship. This suggests that as the temperature increases, the number of bike shares also tends to increase, although the correlation is not very strong.
The variable “cnt” has a negative correlation of -0.46 with “hum” (humidity), indicating a moderate negative relationship. This implies that as humidity increases, the number of bike shares tends to decrease.
The variable “cnt” has a weak positive correlation of 0.12 with “wind_speed”, suggesting a weak positive relationship. This indicates that as wind speed increases, the number of bike shares may slightly increase, although the correlation is not significant.
All the variables can be useful in this analysis except
timestamp and t2 so it is excluded from
further analyses.
Let’s take a loot to the both features on a line Graph first:
# Convert timestamp to POSIXct format
bike$timestamp <- as.POSIXct(bike$timestamp)
# Create a line graph for the whole dataset
ggplot(bike, aes(x = timestamp)) +
geom_line(aes(y = t1, color = "t1"), linetype = "solid") +
geom_line(aes(y = t2, color = "t2"), linetype = "solid") +
xlab("Timestamp") +
ylab("Temperature (Celsius)") +
scale_color_grey() +
labs(color = "Temperature Feature") +
theme_classic()
# Filter data for 2016
bike_2016 <- filter(bike, year == 2016)
# Create a line graph for 2016
ggplot(bike_2016, aes(x = timestamp)) +
geom_line(aes(y = t1, color = "t1"), linetype = "solid") +
geom_line(aes(y = t2, color = "t2"), linetype = "solid") +
xlab("Timestamp") +
ylab("Temperature (Celsius)") +
scale_color_grey() +
labs(color = "Temperature Feature") +
theme_classic()
# Filter data for Jan - Mar 2016
bike_jan_mar_2016 <- filter(bike, year == 2016, month %in% c(1, 2, 3))
# Create a line graph for Jan - Mar 2016
ggplot(bike_jan_mar_2016, aes(x = timestamp)) +
geom_line(aes(y = t1, color = "t1"), linetype = "solid") +
geom_line(aes(y = t2, color = "t2"), linetype = "solid") +
xlab("Timestamp") +
ylab("Temperature (Celsius)") +
scale_color_grey() +
labs(color = "Temperature Feature") +
theme_classic()
From the provided line graph, we can observe the following:
Temperature Variation: The graph shows the variation of temperature over time. The solid black line represents the temperature (t1) and the gray line represents the apparent temperature (t2).
Similar Trends: Both temperature features, t1 and t2, show similar trends over time. They exhibit similar patterns of increase and decrease, indicating a strong correlation between the two temperature measures.
Magnitude Differences: However, there is a noticeable difference in the magnitude between t1 and t2. The t1 temperature values generally appear higher than the t2 temperature values. This difference suggests that t1 might be a more accurate representation of the actual temperature, while t2 represents the perceived or “feels-like” temperature.
Seasonal Patterns: The graph also reveals seasonal patterns in temperature. There are peaks and troughs that occur at regular intervals, suggesting temperature fluctuations corresponding to different seasons.
Overall, the graph provides insights into the temperature variations over time, highlighting the similarities and differences between the t1 and t2 temperature measures. It can be useful for understanding temperature patterns, identifying trends, and making comparisons between different temperature measurements.
# exclude t2
bike <- bike %>% select(-t2)
Our dataset has a significantly lower number of data points for the year 2017 compared to the other years (2016 and 2015), we excluded the year 2017 from our analysis. Here are a few factors that we considered when we made this decision:
Data imbalance: Removing the year 2017 can help mitigate this issue and provide a more balanced dataset. Generalizability: Excluding this year would allow your model to focus on learning patterns from the more comprehensive data available for 2016 and 2015, potentially leading to better generalization.
# Count the number of data points for each year
year_counts <- table(bike$year)
# Print the counts
print(year_counts)
##
## 2015 2016 2017
## 8643 8699 72
# Create a bar plot of the counts
barplot(year_counts, main = "Number of Data Points by Year", xlab = "Year", ylab = "Count")
# Count the number of unique days per year
days_per_year <- bike %>%
group_by(year) %>%
summarise(number_of_days = n_distinct(date(timestamp)))
# Print the number of days per year
print(days_per_year)
## # A tibble: 3 × 2
## year number_of_days
## <fct> <int>
## 1 2015 362
## 2 2016 365
## 3 2017 3
# Filter data to exclude 2017
bike <- bike %>% filter(year != 2017)
# data summarization
skim(bike)
| Name | bike |
| Number of rows | 17342 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| factor | 8 |
| numeric | 4 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| weather_code | 0 | 1 | FALSE | 4 | 1: 6120, 2: 4027, 4: 3657, 3: 3538 |
| is_holiday | 0 | 1 | FALSE | 2 | 0: 16982, 1: 360 |
| is_weekend | 0 | 1 | FALSE | 2 | 0: 12396, 1: 4946 |
| season | 0 | 1 | FALSE | 4 | 0: 4394, 1: 4387, 2: 4303, 3: 4258 |
| year | 0 | 1 | FALSE | 2 | 201: 8699, 201: 8643, 201: 0 |
| month | 0 | 1 | FALSE | 12 | 5: 1488, 8: 1484, 12: 1484, 7: 1481 |
| day | 0 | 1 | FALSE | 31 | 6: 576, 14: 576, 21: 576, 22: 576 |
| hour | 0 | 1 | FALSE | 4 | bet: 9377, bet: 4348, bet: 2167, bet: 1450 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| cnt | 0 | 1 | 1145.67 | 1085.92 | 0.0 | 260.0 | 846.0 | 1676.0 | 7860.0 | ▇▂▁▁▁ |
| t1 | 0 | 1 | 12.50 | 5.56 | -1.5 | 8.5 | 12.5 | 16.0 | 34.0 | ▂▇▇▂▁ |
| hum | 0 | 1 | 72.28 | 14.32 | 20.5 | 63.0 | 74.5 | 83.0 | 100.0 | ▁▂▅▇▅ |
| wind_speed | 0 | 1 | 15.92 | 7.90 | 0.0 | 10.0 | 15.0 | 20.5 | 56.5 | ▆▇▃▁▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| timestamp | 0 | 1 | 2015-01-04 | 2016-12-31 23:00:00 | 2016-01-02 03:30:00 | 17342 |
# exclude timestamp
bike <- bike %>% select(-timestamp)
# Check features again
nm <- colSums(is.na(bike))
# convert missing count information to a table
tibble(Variable=names(nm), Number_of_Missing=as.vector(nm)) %>%
knitr::kable()
| Variable | Number_of_Missing |
|---|---|
| cnt | 0 |
| t1 | 0 |
| hum | 0 |
| wind_speed | 0 |
| weather_code | 0 |
| is_holiday | 0 |
| is_weekend | 0 |
| season | 0 |
| year | 0 |
| month | 0 |
| day | 0 |
| hour | 0 |
In this section, we are going to train a regression model to the data in order to predict the number of bike shares in London using independent variables. First we need to divide the data to training and test data sets as below:
set.seed(123)
train_index <- createDataPartition(bike$cnt, p = 0.8, list = FALSE)
train_data <- bike[train_index, ]
test_data <- bike[-train_index, ]
Now we use feature scaling and scale the numerical variables as below:
numerical_train <- train_data %>% select(cnt:wind_speed)
numerical_test <- test_data %>% select(cnt:wind_speed)
bike_proc_tr <- preProcess(numerical_train, method = c("center", "scale"))
bike_proc_te <- preProcess(numerical_test, method = c("center", "scale"))
bike_scaled_tr <- predict(bike_proc_tr, numerical_train)
bike_scaled_te <- predict(bike_proc_te, numerical_test)
The scaled data are combined with the categorical variables as below:
train_cat <- train_data %>% select_if(is.factor)
test_cat <- test_data %>% select_if(is.factor)
train_final <- cbind.data.frame(bike_scaled_tr, train_cat)
test_final <- cbind.data.frame(bike_scaled_te, test_cat)
Now a linear model is fitted to the training data:
# Fit linear model
fit1 <- lm(cnt ~ ., data = train_final)
# Predict the values for the test data
pred1 <- predict(fit1, newdata = test_final)
# Create a data frame for observed and predicted values
df1 <- data.frame(Observed = test_final$cnt, Predicted = pred1)
# Create the scatter plot with regression line
ggplot(data = df1, aes(x = Observed, y = Predicted)) +
geom_point(col = 'cadetblue4') +
geom_smooth(method = 'lm', col = 'red', se = FALSE) + # Add the regression line
scale_x_continuous(n.breaks = 10) +
scale_y_continuous(n.breaks = 10) +
labs(title = 'The scatter plot of bike shares predictions using linear regression model')
There’s a definite diagonal trend, and the intersections of the predicted and actual values are generally following the path of the trend line; but there’s a fair amount of difference between the ideal function represented by the line and the results. This variance represents the residuals of the model - in other words, the difference between the label predicted when the model applies the coefficients it learned during training to the validation data, and the actual value of the validation label. We can quantify the residuals by calculating a number of commonly used evaluation metrics.
Therefore, the above mentioned criteria are calculated for the model evaluation:
mse <- mean((df1$Observed-df1$Predicted)^2)
cat(paste('The Mean Squared Error is', mse), sep = '\n')
## The Mean Squared Error is 0.464380947836642
rmse <- sqrt(mse)
cat(paste('The Root Mean Squared Error is', rmse), sep = '\n')
## The Root Mean Squared Error is 0.681455022607246
r_squared <- summary(fit1)$r.squared
cat(paste('The Coefficient of Determination is', r_squared), sep = '\n')
## The Coefficient of Determination is 0.511898921541363
The model has some predictive power, but the random forest model might perform better.
Decision tree modeling is a popular machine learning algorithm used for classification and regression analysis. It is a graphical representation of all possible outcomes of a decision based on certain conditions or variables. The decision tree consists of nodes, branches, and leaves. The nodes represent the conditions or variables, the branches represent the possible outcomes, and the leaves represent the final decision or prediction.
The decision tree algorithm works by recursively splitting the data into smaller subsets based on the most significant variable that can best separate the data into different classes. This process continues until all data points are classified correctly or until a stopping criterion is met.
# fit decision tree model
fit2 <- rpart(cnt~., data=train_final, method = "anova")
library(rpart.plot)
# Plot the decision tree
rpart.plot(fit2, main = "Decision Tree ")
# predict the one the test data
pred2 <- predict(fit2, newdata = test_final, method = "anova")
# plot the predicted vs observed in the test set
df2 <- data.frame(Observed=test_final$cnt, Predicted=as.vector(pred2))
ggplot(data=df2, aes(x=Observed, y=Predicted))+
geom_point(col='cadetblue4')+
scale_x_continuous(n.breaks = 10)+
scale_y_continuous(n.breaks = 10)+
labs(title = 'The scatter plot of bike shares predictions using decision tree model')
The scatter plot shows a fairly linear pattern, indicating that the model’s predictions are relatively close to the actual bike share values. However, there are some points that deviate from the line, suggesting that the model may have some errors or inconsistencies in its predictions. Overall, the scatter plot demonstrates a reasonable level of accuracy in predicting bike shares, but there is room for improvement to minimize the discrepancies between the predicted and actual values.
Again, the same criteria which were calculated for the linear regression model, are evaluated for the decision tree model in order to compare the performance of the two models:
mse2 <- mean((df2$Observed-df2$Predicted)^2)
cat(paste('The Mean Squared Error is', mse2), sep='\n')
## The Mean Squared Error is 0.413126563278651
rmse2 <- sqrt(mse2)
cat(paste('The Root Mean Squared Error is', rmse2), sep='\n')
## The Root Mean Squared Error is 0.64274922269782
# Calculate R-squared
ssr2 <- sum((df2$Predicted - mean(df2$Observed))^2)
sst2 <- sum((df2$Observed - mean(df2$Observed))^2)
r_squared2 <- 1 - (ssr2 / sst2)
cat(paste('The Coefficient of Determination (R-squared) is', r_squared2), sep='\n')
## The Coefficient of Determination (R-squared) is 0.397921861651633
set.seed(123)
target_variable <- "cnt"
# Split the data into input features (X) and target variable (y)
X_train <- train_final %>% select(-target_variable)
y_train <- train_final[[target_variable]]
X_test <- test_final %>% select(-target_variable)
y_test <- test_final[[target_variable]]
# Convert factor variables to numeric
X_train <- X_train %>%
mutate_if(is.factor, as.numeric)
X_test <- X_test %>%
mutate_if(is.factor, as.numeric)
# Convert the data to the xgboost format
dtrain <- xgb.DMatrix(data = as.matrix(X_train), label = y_train)
dtest <- xgb.DMatrix(data = as.matrix(X_test), label = y_test)
# Define the parameters for the XGBoost model
params <- list(
objective = "reg:squarederror", # Regression objective function
eval_metric = "rmse", # Evaluation metric: Root Mean Squared Error
nrounds = 100, # Number of boosting rounds
early_stopping_rounds = 10, # Early stopping rounds
verbose = FALSE # Print log messages or not
)
# Train the XGBoost model
xgb_model <- xgb.train(params = params, data = dtrain, nrounds = 100, watchlist = list(train = dtrain, test = dtest))
## [10:54:08] WARNING: src/learner.cc:767:
## Parameters: { "early_stopping_rounds", "nrounds", "verbose" } are not used.
##
## [1] train-rmse:0.897027 test-rmse:0.891035
## [2] train-rmse:0.756121 test-rmse:0.744858
## [3] train-rmse:0.673883 test-rmse:0.663039
## [4] train-rmse:0.622357 test-rmse:0.614597
## [5] train-rmse:0.591860 test-rmse:0.585529
## [6] train-rmse:0.574501 test-rmse:0.571201
## [7] train-rmse:0.561882 test-rmse:0.561953
## [8] train-rmse:0.551891 test-rmse:0.554967
## [9] train-rmse:0.543468 test-rmse:0.549252
## [10] train-rmse:0.537497 test-rmse:0.544000
## [11] train-rmse:0.533137 test-rmse:0.540871
## [12] train-rmse:0.529504 test-rmse:0.539054
## [13] train-rmse:0.523357 test-rmse:0.534668
## [14] train-rmse:0.519434 test-rmse:0.531948
## [15] train-rmse:0.515827 test-rmse:0.531144
## [16] train-rmse:0.512211 test-rmse:0.528391
## [17] train-rmse:0.508699 test-rmse:0.526213
## [18] train-rmse:0.503330 test-rmse:0.526335
## [19] train-rmse:0.501280 test-rmse:0.525991
## [20] train-rmse:0.496877 test-rmse:0.525385
## [21] train-rmse:0.494660 test-rmse:0.523912
## [22] train-rmse:0.492708 test-rmse:0.524263
## [23] train-rmse:0.491682 test-rmse:0.523317
## [24] train-rmse:0.487720 test-rmse:0.521902
## [25] train-rmse:0.485089 test-rmse:0.521223
## [26] train-rmse:0.483168 test-rmse:0.521004
## [27] train-rmse:0.481945 test-rmse:0.520440
## [28] train-rmse:0.479248 test-rmse:0.520937
## [29] train-rmse:0.476716 test-rmse:0.521916
## [30] train-rmse:0.475314 test-rmse:0.522120
## [31] train-rmse:0.474000 test-rmse:0.522068
## [32] train-rmse:0.470682 test-rmse:0.520938
## [33] train-rmse:0.469324 test-rmse:0.520651
## [34] train-rmse:0.467817 test-rmse:0.520100
## [35] train-rmse:0.467408 test-rmse:0.520229
## [36] train-rmse:0.467073 test-rmse:0.519929
## [37] train-rmse:0.464584 test-rmse:0.520645
## [38] train-rmse:0.463744 test-rmse:0.520933
## [39] train-rmse:0.461735 test-rmse:0.521289
## [40] train-rmse:0.459794 test-rmse:0.520643
## [41] train-rmse:0.459070 test-rmse:0.520309
## [42] train-rmse:0.457798 test-rmse:0.520234
## [43] train-rmse:0.456392 test-rmse:0.520037
## [44] train-rmse:0.454147 test-rmse:0.519519
## [45] train-rmse:0.453785 test-rmse:0.519326
## [46] train-rmse:0.452839 test-rmse:0.519608
## [47] train-rmse:0.452566 test-rmse:0.519617
## [48] train-rmse:0.451560 test-rmse:0.519361
## [49] train-rmse:0.450378 test-rmse:0.519297
## [50] train-rmse:0.448348 test-rmse:0.519114
## [51] train-rmse:0.447490 test-rmse:0.519134
## [52] train-rmse:0.446692 test-rmse:0.519956
## [53] train-rmse:0.445891 test-rmse:0.519427
## [54] train-rmse:0.443463 test-rmse:0.518809
## [55] train-rmse:0.441752 test-rmse:0.518524
## [56] train-rmse:0.440224 test-rmse:0.518557
## [57] train-rmse:0.439828 test-rmse:0.518510
## [58] train-rmse:0.439483 test-rmse:0.518447
## [59] train-rmse:0.438998 test-rmse:0.518455
## [60] train-rmse:0.438799 test-rmse:0.518474
## [61] train-rmse:0.437833 test-rmse:0.518759
## [62] train-rmse:0.435421 test-rmse:0.518799
## [63] train-rmse:0.433532 test-rmse:0.518648
## [64] train-rmse:0.432876 test-rmse:0.519057
## [65] train-rmse:0.430204 test-rmse:0.519137
## [66] train-rmse:0.428878 test-rmse:0.518653
## [67] train-rmse:0.427847 test-rmse:0.519388
## [68] train-rmse:0.426643 test-rmse:0.519559
## [69] train-rmse:0.425878 test-rmse:0.519793
## [70] train-rmse:0.424928 test-rmse:0.519536
## [71] train-rmse:0.424003 test-rmse:0.519524
## [72] train-rmse:0.422671 test-rmse:0.519563
## [73] train-rmse:0.420626 test-rmse:0.519847
## [74] train-rmse:0.419192 test-rmse:0.519965
## [75] train-rmse:0.418654 test-rmse:0.520061
## [76] train-rmse:0.417579 test-rmse:0.520812
## [77] train-rmse:0.416525 test-rmse:0.520674
## [78] train-rmse:0.415101 test-rmse:0.520903
## [79] train-rmse:0.413527 test-rmse:0.520898
## [80] train-rmse:0.411983 test-rmse:0.520631
## [81] train-rmse:0.411411 test-rmse:0.520728
## [82] train-rmse:0.410229 test-rmse:0.521120
## [83] train-rmse:0.409255 test-rmse:0.521298
## [84] train-rmse:0.408776 test-rmse:0.521384
## [85] train-rmse:0.408161 test-rmse:0.521402
## [86] train-rmse:0.407570 test-rmse:0.521492
## [87] train-rmse:0.407352 test-rmse:0.521392
## [88] train-rmse:0.405752 test-rmse:0.521559
## [89] train-rmse:0.404376 test-rmse:0.521863
## [90] train-rmse:0.403122 test-rmse:0.521659
## [91] train-rmse:0.401176 test-rmse:0.521263
## [92] train-rmse:0.400555 test-rmse:0.521104
## [93] train-rmse:0.399100 test-rmse:0.521280
## [94] train-rmse:0.398417 test-rmse:0.521640
## [95] train-rmse:0.397836 test-rmse:0.521738
## [96] train-rmse:0.396902 test-rmse:0.521686
## [97] train-rmse:0.396463 test-rmse:0.521667
## [98] train-rmse:0.395456 test-rmse:0.521585
## [99] train-rmse:0.394861 test-rmse:0.521927
## [100] train-rmse:0.393967 test-rmse:0.521700
# Make predictions on the test set
predictions <- predict(xgb_model, newdata = dtest)
Based on the output, we can see that as the model iterates, the train RMSE gradually decreases from 0.897027 to 0.393967, indicating that the model is improving its fit to the training data. Similarly, the test RMSE decreases from 0.891035 to 0.521700, suggesting that the model is also generalizing well to unseen data. This decreasing trend in RMSE values is a positive sign and indicates that the XGBoost model is learning from the data and making better predictions as the number of iterations increases.
# Evaluate the model
mse4 <- mean((predictions - y_test)^2)
rmse4 <- sqrt(mse4)
r_squared4 <- 1 - mse4 / var(y_test)
# Print the evaluation metrics
print(paste("MSE:", mse4))
## [1] "MSE: 0.272171180123888"
print(paste("RMSE:", rmse4))
## [1] "RMSE: 0.521700278056173"
print(paste("R-squared:", r_squared4))
## [1] "R-squared: 0.727828819876112"
Based on the evaluation metrics, it is evident that both the XGBoost and Random Forest models have performed well in predicting the target variable (cnt). The XGBoost model achieved a lower MSE of 0.2721712 and RMSE of 0.5217003, indicating that it produced more accurate predictions compared to the Random Forest model, which had an MSE of 0.2996687 and RMSE of 0.5474200. Additionally, both models achieved relatively high R-squared values, with the XGBoost model reaching 0.7278288 and the Random Forest model reaching 0.7003313. These results suggest that both models have captured a significant portion of the variance in the target variable (cnt) and exhibit good predictive performance. Overall, the XGBoost model showcases slightly better performance, demonstrating its effectiveness in this particular scenario.
Prepare the data
set.seed(123)
train_index <- createDataPartition(bike$cnt, p = 0.8, list = FALSE)
train_data <- bike[train_index, ]
test_data <- bike[-train_index, ]
numerical_train <- train_data %>% select(cnt:wind_speed)
numerical_test <- test_data %>% select(cnt:wind_speed)
bike_proc_tr <- preProcess(numerical_train, method = c("center", "scale"))
bike_proc_te <- preProcess(numerical_test, method = c("center", "scale"))
bike_scaled_tr <- predict(bike_proc_tr, numerical_train)
bike_scaled_te <- predict(bike_proc_te, numerical_test)
train_cat <- train_data %>% select_if(is.factor)
test_cat <- test_data %>% select_if(is.factor)
train_final <- cbind.data.frame(bike_scaled_tr, train_cat)
test_final <- cbind.data.frame(bike_scaled_te, test_cat)
target_variable <- "cnt"
# Split the data into input features (X) and target variable (y)
X_train <- train_final %>% select(-target_variable)
y_train <- train_final[[target_variable]]
X_test <- test_final %>% select(-target_variable)
y_test <- test_final[[target_variable]]
# Train the Random Forest model
rf_model <- randomForest(x = X_train, y = y_train, ntree = 100, importance = TRUE)
# Make predictions on the test set
predictions <- predict(rf_model, newdata = X_test)
Evaluate our Random Forest Model
mse5 <- mean((predictions - y_test)^2)
rmse5 <- sqrt(mse5)
r_squared5 <- 1 - mse5 / var(y_test)
print(paste("MSE:", mse5))
## [1] "MSE: 0.299668665013543"
print(paste("RMSE:", rmse5))
## [1] "RMSE: 0.547420007867399"
print(paste("R-squared:", r_squared5))
## [1] "R-squared: 0.700331334986457"
Our result shows there’s a minor different between Random Forest and XGBoost whereas XGBoost is slightly better than Random Forest.
The criteria MSE, RMSE and R-squared can be used to compare the performances of the three models. The following codes provide the calculated criteria for all the models in one table:
# Calculate the evaluation metrics for each model
metrics <- c("MSE", "RMSE", "R-squared")
Regression <- c(mse, rmse, r_squared)
Decision_Tree <- c(mse2, rmse2, r_squared2)
XGBoost_model <- c(mse4, rmse4, r_squared4)
Random_Forest <- c(mse5, rmse5, r_squared5)
# Create the tibble/table
result_table <- tibble(Model = metrics,
Regression = Regression,
Decision_Tree = Decision_Tree,
XGBoost = XGBoost_model,
Random_Forest = Random_Forest)
print(result_table)
## # A tibble: 3 × 5
## Model Regression Decision_Tree XGBoost Random_Forest
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 MSE 0.464 0.413 0.272 0.300
## 2 RMSE 0.681 0.643 0.522 0.547
## 3 R-squared 0.512 0.398 0.728 0.700
“The XGBoost model achieved an R-squared value of 72.78%, indicating that approximately 72.78% of the variance in the target variable can be explained by the features included in the model. This suggests a moderately strong relationship between the independent variables and the dependent variable.”
Variable importance in a decision tree model refers to the measure of how much each input variable contributes to the accuracy of the model. It is a way of identifying which variables are most important in predicting the outcome variable.
# Calculate variable importance
importance <- xgb.importance(feature_names = colnames(X_train), model = xgb_model)
print(importance)
## Feature Gain Cover Frequency
## 1: hour 0.318893907 0.098464224 0.102457535
## 2: hum 0.264324154 0.204016385 0.184857246
## 3: is_weekend 0.159577459 0.020991354 0.038127936
## 4: t1 0.112384799 0.231179158 0.179074810
## 5: wind_speed 0.031683097 0.130370876 0.170581858
## 6: month 0.028583157 0.100347618 0.070473437
## 7: weather_code 0.027189440 0.042196155 0.061438381
## 8: day 0.026454591 0.119448488 0.131550416
## 9: is_holiday 0.013133724 0.009960921 0.008312252
## 10: season 0.011069208 0.026919521 0.029815685
## 11: year 0.006706465 0.016105298 0.023310445
# Plot variable importance
xgb.plot.importance(importance_matrix = importance,
top_n = 10, # Show top 10 important features
xlab = "Feature Importance ",
ylab = " ",
main = "XGBoost Variable Importance Plot", # Plot title
col = "steelblue",
horizontal = TRUE)
Gain emphasizes the predictive power of a feature.
Cover highlights the frequency of using a feature for splitting.
Frequency reflects the prevalence of a feature in the dataset.
“The XGBoost model achieved an R-squared value of 72.78%, indicating that approximately 72.78% of the variance in the target variable can be explained by the features included in the model. This suggests a moderately strong relationship between the independent variables and the dependent variable.”